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Abstract 

Recently, ?2.i matrix norm has been widely applied to many areas such as 
computer vision, pattern recognition, biological study and etc. As an extension 
of h vector norm, the mixed h,! matrix norm is often used to find jointly sparse 
solutions. Moreover, an efficient iterative algorithm has been designed to solve 
Z2,i-norm involved minimizations. Actually, computational studies have showed 
that /p-regularization (0 < p < 1) is sparser than Zi-regularization, but the exten- 
sion to matrix norm has been seldom considered. This paper presents a definition 
of mixed l2,p (p € (0, 1]) matrix pseudo norm which is thought as both generaliza- 
tions of Ip vector norm to matrix and i2,i-norm to nonconvex cases (0 < p < 1). 
Fortunately, an efficient unified algorithm is proposed to solve the induced l2,p- 
norm (p £ (0, 1]) optimization problems. The convergence can also be uniformly 
demonstrated for all p € (0, 1]. Typical p G (0, 1] are applied to select features 
in computational biology and the experimental results show that some choices of 
< p < 1 do improve the sparse pattern of using p = 1. 

1 Introduction 

In many fields, such as computer vision, pattern recognition, computational biology 
and etc., mixed ^2,1 matrix norm has received increasing attention for its joint sparsity 
pattern. In multi-task feature learning. The authors of [15 1 and ||2l have proposed sim- 
ilar models as Z2.i-norm regularization to couple feature selection across tasks. But 
the approach to solve this problem proposed in [23] has no known convergence rate. 
Liu et al. lfT2l reformulate the nonsmooth /2,1-norm regularized optimization to two 
smooth convex optimization problems, then apply Nesterov's method to solve them. 
This algorithm analytical computes the solution or globally converges to the solution 
in linear time. Recently, a proximal alternating direction method is addressed in ||26l to 
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solve i2,i-norm regularized least square problem for multi-task feature learning. The 
Z2,i-norm involved minimization has also been successfully employed in correlated 
attribute transfer with multi-task graph-guided fusion [27 1 and nonnegative graph em- 
bedding |f30l . Moreover, the authors of [24] have used spectral regression with ^2.1- 
norm constraint to evaluate features jointly. The group Lasso [20 25] and the logistic 
group-lasso [ 14] are constructed with Z2,i-norm regularization in many applications. 

One major challenge of /2.1-norm minimization is how to efficiently solve this non- 
smooth optimization problem. The authors of [ 1 1 propose a directly iterative algorithm 
to solve the robust Z2.i-norm minimization of both loss function and regularization. 
And the global convergence is proved in the same literature. The algorithm has been 
widely used in many applications for its efficient behavior and construction, for ex- 
ample in I l28ll29l . This algorithm has been modified to unsupervised feature selection 
I2I 31 1 and semi-supervised learning 1131 . A spatial group sparse coding in image- 
level tagging [22] and multi-instance learning (17] also employ the similar technique. 

On the whole, all the models and algorithms mentioned are constructed in the con- 
vex Zi-norm framework. Actually, extensive computational studies [0] |5] |6] [19] have 
showed that using /p-norm (0 < p < 1) can find sparser solution than using Zi-norm. 
Naturally, one can expect l2,p-noTm (0 < p < 1) based minimization to be a better 
sparsity pattern than /2,1-norm. Recently, a similar Ip ~ Iq {0 < p < 1, 1 < <? < 2) 
penalty for sparse linear and multiple kernel multi-task learning has been considered 
in ^2\. But the induced optimization problems have to be separately solved by differ- 
ent algorithms according to the convex (p — 1) and non-convex (0 < p < 1) cases. 
This disadvantage brings computational difficulty to freely vary p and q. In this paper, 
we define a mixed I2.P (p G (0, 1]) matrix norirQ and present a unified algorithm to 
solve the involved /2,j,-norm based minimizations for all p S (0, 1] . To the best of 
our knowledge, it is the first algorithm to uniformly solve this specially mixed convex 
and nonconvex optimization problems. The presentation has several innovations as fol- 
lows. 1) It is a generalization of Z2.1— norm regularization to nonconvex case, /p-norm 
(0 < p < 1) is neither convex nor Lipschitz continuous, then the induced Z2.p-norm 
based optimization problem is nonconvex and non-Lipschitz continuous yet. 2) Since 
'2,j9-iiorm (p E (0, 1]) based functions are neither convex nor Lipschitz continuous ex- 
cept for p — 1, efficiently solving the mixed problem is much more challenging than 
pure Z2,i-norm minimization. Here we extend the existing work in JT] to a unified al- 
gorithm solving all the Z2,p-norm {p £ (0, 1]) optimization problems. \f p — 1, the 
general algorithm is reduced to the case of [1]. If < p < 1, the unified algorithm 
finds a local approximate solution to nonconvex Z2,p-norm minimization. Fortunately, 
the convergence can also be uniformly proved for all p e (0, 1]. 3) Typical p E (0, 1] 
are tested in Z2,p-norm based objective functions. The experiments in bioinformatics 
study provide empirical evidence that some < p < 1 are alternatives in constructing 
sparsity patterns while p = 0.5 obviously outperforms p = 1 . 

' II • ||2,p (0 < p < 1) is not a valid matrix norm because it does not admit tlie triangular inequality. Here 
we call it matrix norm for convenience. 
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2 Notations and Definitions 



We employ the notations as usual. Matrices are written as boldface uppercase letters 
while vectors are written as boldface lowercase letters. For example, A = {aij)mxc 
denotes a real m x c matrix, a* e = 1, • • • , m) and aj e = 1, • • • , c) are 

the th row and j— th column of A respectively. 

For any x E i?™, several useful vector norms are defined as follows, 

m m 

||:r||o= ^ = (1) 

Xi^O i=l i=l 

where p G (0,1). Actually, neither nor Zp (0 < p < 1) is a well defined norm because 
the former does not satisfy the positive scalability and the latter does not satisfy the 
triangular inequality. Here we call them norms for simplicity. 

Z2.i-norm of matrix was firstly introduced in [8] which is a strict matrix norm sat- 
isfying the norm axioms, 

m 

\\Ah,i = Y.M\2. (2) 

i=l 

It is well known that || • ||2,i is convex with respect to matrix variable. Now we gener- 
alize the definition of Z2,i-norm to mixed Z2,p-norm as follows 

m 

Pll2,p = (Ell«'ll2)^> pe(o,i]. (3) 

2=1 

Obviously, /2,p-norm is reduced to /2,i— norm when p = I. Note that Ip (0 < p < 1) 
pseudo norm does not admit the triangular inequality on R"\ then the corresponding 
^2.p-norm is not a valid matrix norm because of 

\\A + B\\2,pi\\A\\2.p+\\B\\2^p, A.BeR^""^. 

Moreover, Zp (0 < p < 1) vector norm is neither convex nor Lipschitz continuous, 
so /2,p matrix pseudo norm is not convex or Lipschitz continuous yet. This properties 
challenge researchers to uniformly solve the mixed convex and noncovex Z2,p-norm 
{p e (0, 1]) based optimization problems. 



3 /2,p-Norm Based Minimizations 

Given observation data {fli, 02, •• • ,a„} G i?'' and corresponding output 62, ' i^n} G 
R'^, generally principled framework in many areas is considering 

min loss(X) + (4) 

where loss(X) and R{X) denote loss function and regularization respectively, a > 
is the regularization parameter Different loss(X) and R{X) are chosen for a variety 
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of data distributions and practical applications. The traditional least square regression 
solves the following optimization problem to obtain the unknown matrix X e R'^^'^: 

n 

imnY,\\X^a,-h\\l+aR{X), (5) 

4=1 

where X contains the projection matrix and bias vector for simplicity. 

It is well known that the square-norm residual is sensitive to outliers, hence Nie et. 
al. m propose to use robust Z2,i— norm loss function 



mmY,\\X^a,-b,\\2 + aR{X). (6) 

i=l 

Here we expect to use the generalized one 

n 

mmY,\\X^a,-h\\P + aR{X), pe{0,l]. (7) 



X 

i=l 



For any p S (0, 1], the noise magnitude of distant outlier in d?) is no more than that in 
(|6|. Thus the model ^ is expected to be more robust than (|6}. 
Joint sparse regularization of R{X) is usually chosen 

d d 

Ra{X)^ J2 \W\\2 or RAX) = J2n- (8) 

Theoretically, R/s{X) are mostly preferred for its desirable sparsity. But R\^{X) is 
practically chosen more often for the computational sake. Under certain conditions, 
i?v (-''^)-regularization is equivalent to i?A(-''^)-regularization. Here we chose the in- 
termediate between Iq and li in the sense 

d 

R4X)=Y,\\xT2, pe(0,l). (9) 

i=l 

Hence the Z2,p— norm based feature selection is reduced to a noncovex and non-Lipschitz 
continuous optimization problem 

n d 

1=1 4=1 

where a = 7^* is the regularization parameter If Z2.i-norm based objective are unified 
in ([Tol l, it becomes a mixed minimization, 

n d 

min5]||X^a,:-&,|l^ + 7^ENl^, pG(0,l]. (11) 

i=l i=l 
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When p — 1, problem (fTTT i is reduced to the popular Z2,i-norm based minimization 
proposed in 1 1]. But if < p < 1, ( fTTT l is non-convex, hence the algorithm in 1 1 1 can 
not be directly applied. As far as we know, very few scheme is presented to uniformly 
solve this specially mixed problem. Therefore, it is necessary to develop an unified 
approach to efficiently solve problem (fill for all p G (0, 1]. 

Denote A ^ [01,03, • • • , a„] e i?'*^" and B = [61,62, ■ ■ • ^ R"''"', the 

objective of problem (fTOl i can be written as 



JiX) 



E \\X^a,-h\\P + jPR4X) 

n d 
i=l 1=1 



(12) 



"EKA^x^BYWl + inxWl,, 

wx-B\\i^+^nm,r 



4 Main Results 

Obviously, problem ( fTTT l is equivalent to 

miri—\\A^X - B\ 



2,P 



\x\\l,. 



(13) 



Let E = -{A^X — B), then unconstrained optimization problem (fTjj becomes 



min 

E,X 



2,P 



\X\\2.p, 



s.t.A^X --fE = B. 



(14) 



It can be easily proved that 



X 

E 



\l, = \\x\\l,- 



l£;|l5 p. Ifwe denote 



Y 



X 

E 



eJ?'"^^ and M := [A'^ -7/„] e i?"^™, (15) 
where m = d + n and /„ is identity matrix, then problem (fl4t can be reformulated as 

(16) 



s.t.MF B. 



Problem (fTST i is not a convex optimization problem except for p = 1, so the so- 
lution to ( fTSb (0 < p < 1) is a local minimization. The Lagrangian function of the 
minimization with linear constraints is 

CiY, A) = ||y|l^_p - TriA^iMY - B)). (17) 

where A e i?"^"^ is Lagrangian multiplier matrix, and Tr{-) stands for trace operator. 
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Y* is the KKT point of problem ( IT6b if and only if there exists a A* G i?"^'^ such 

that 



= 2D,Y* - M^A* = 
MY* = B 



(18) 



where 

is induced from Y* . After simple reformulation, ([18) is equivalent to 

Y* = D-'^A^iAD-^A^y^B. (20) 
If M has full-column rank, then Y* satisfying (l20b is a local minimization to problem 

Then an iterative algorithm to solve equation (|20] i can be designed as follows. 
Algorithm 4.1. (Solving Problem ( [76t ) 

2. 5ef fc = and initialize Dq = /,„ 

3. Iterate: For k — 1,2, ■ ■ ■ until convergence do : 

Y,^D^l,M^iMD^\M^r^B, 
Update Dk with diagonal entries : 

□ 

Remark 4.1. If D,Y are computed as in ( Ii9l ) and ( I2QI ). if can Z^e easily derived that 

P\\Y\\P 



Remark 4.2. //■ f/ze = happens in some iteration, then Dk can not be well up- 
dated and algorithm 114. U breaks down. Here we employ similar techniques in STj to 

overcome it. One choice is setting the i—th diagonal element of Dj^ to be — . 

Another way is to give a perturbation e such that d'j! = — . . ^ 0. 

Now, let us show the convergence of Algorithm (14.1b . Actually, ||yfe||2 p monoton- 
ically decreases with respect to iterations. 

Lemma 4.1. If ^p(t) = ^^:^t — 53^*^ ~ 1. where p e (0, 1], then for any t > 0, 
ip{t) < 0. 

Proof Taking derivative of Lp{t) with respect to t, and setting it to zero, that is 

^'(t) = -^(l-tf-i)=0, 
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then we have the unique stationary point i = 1 on (0, +00). It can be easily proved 
that t = 1 is just the maximum point. Hence 

(p{t) < = 0, t > 0. 

□ 

Lemma 4.2. Suppose that and y],^i are the i^th row ofYk and Yk+i generated by 
algorithm i4.1\l respectively, then for p G (0, 1] 



II , IIP p \\yl+i\\2 ^ II ,,|p P \\yl\\-2 

Equality in ( 1271 ) holds if and only if\\yl | = \\yk\\2- 
Proof Let U = t^F in fit), then < 0, that is 



i 112 

i — 1, - ■ ■ , m. (21) 



WvX -v-vv' — ||j,.||; 

2 \\VI+X2 P H+iWl 

lly^ll^ ^-phl+iWl 



- 1 < 0. (22) 



Note that Il2/^_|_ill2 = 112/1-112 sufficient and necessary to let the equality in (l22T i hap- 
pen. Multiplying the two sides of formula (|22] | with (1 — f , we have 



^WVkh ^ 

which is also an equivalent formula of (l2Tl l. □ 

Theorem 4.1. 1 1 Yfe 1 1 2 p generated by algorithm ( 14. 71 ) monotonically decreases with re- 
spect to iteration k. So it converges to the KKT point of problem ( l76l ) which is also a 
local minimization of U6\l if M has full-column rank. 

Proof From remark (14. H and construction of algorithm (14. H . we can easily verify 



So we have 
which is to say 



Ffe+i - arg ^min^ Tr{Y' DkY). (24) 



Tr{Y^+^DkYu+i) < Tr{Y^DuYu), (25) 
yP\\yUf2^y Phlf 



On the other hand, formula (ISTT l in Lemma 4.2 shows 



i 112 



i=i ^Il2/fell2 ^ ^Ilyfcll2 
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Combining equalities (|26] > and (|27] >, we have 

m in 

EiiyUiiif <Ei 



which is also HYfc+iUjp < Hi^fcHfp- Thus algorithm ( 14. Il l generates a monotonically 
decreasing iterations which converge to the KKT point of problem ( fT6] l. Since < 
p < 1, problem (fT6] l is not a convex optimization. If AI has full-column rank, the 
convergence point of {Yfe} is a local minimization of ( fT6l l. □ 

Remark 4.3. To some extent, also rithm \4.1\ offers an alternative to solve Ip (0 < p < 
1) regularized problems when the number of columns in Y is 1. 

Remark 4.4. Alsorithm \4.1\ is a unified approach to solve problem ( Ii6l ) for any p G 
(0,1]. This scheme provides algorithmic support to adapt p in (0, 1] to improve sparsity 
pattern for different data structure regardless of convex or nonconvex cases. 



It is worth to point out that algorithm 14. 1 1 can be easily extended to solve other 
general l2.p (p G (0, 1]) regularized minimization 

mmf{Y)+Y,\\MtY + Bt\\lp (28) 
t 

by iteratively solving the equivalent form 

mm f{Y) + J2 Tr{{MtY + Btf Dk{MtY + B*)), (29) 
t 

where Dk ~ diag{ 



p 



2||(Mty+St)i||^-'" 2\\{,MtY+BtY 



2||(M.r+B.)"r^^- Especially consider 

mmWA'Y -B\\l + a\\Y\\lp. (30) 

The lower bound of nonzero entries in solutions to problem (l30b is expected to estimate 
from the theory in Q. This possible result is useful to enhance practical algorithm 
solving problem ( |30] |. 



5 Experimental Results 

We apply algorithm 14.11 to feature selection in biological study. In our experiments, 
four public data sets are used. Brief description about all data sets is given as follows. 

ALLAML is Leukemia gene microarray data, originally obtained by Golub et.al. 
ifTOl . There are 7129 genes, containing two classes: acute lymphocytic leukemia 
(ALL) and acute mylogenous leukemia (AML). 

GLIOMA contains four classes, caner glioblastomas (CG), non-cancer glioblastomas 
(NG), cancer oligodendrogliomas (CO) and non-cancer oligodendrogliomas (NO) 
There are total 50 samples and each class has 14, 4, 7, 15 samples respectively. 
Each sample has 12625 genes. 
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LUNG cancer data is available at ifTTl . There are 12533 genes, total 181 samples 
in two classes: malignant pleural mesothelioma (MPM) and adenocarcinoma 
(ADCA) of the lung. 

Prostate-GE data set has 12600 genes. There are 102 samples in two classes tumor 
and normal. 52 samples are tumor and 50 samples are normal. The dataset is 
available in lfT6l . 

All data set are firstly performed the same preprocessing as in fW]. Then the data sets 
are standardized to be zero-mean and nomalized by standard deviation. To demonstrate 
the effect of different ?2,p matrix pseudo norms in feature selection, typical p E (0, 1] 
are tested by algorithm l4.1l Here we implement p = 0.25, 0.5, 0.75 and 1 in l2,p-norm 
based optimization problems. Using top 20,40,60,80 features, SVM classifiers are 
individually performed on all data sets with 5— fold crosses. The classification errors 
are reported in tables [T]|2] 



Table 1: Classification error (%) of different l2,p matrix norms 







Top 20 features 






Top 40 features 




p= 


0.25 


0.5 


0.75 


1 


0.25 


0.5 


0.75 


1 


ALLAML 


6.86 


4 


6.67 


5.43 


5.52 


4.1 


5.52 


4.1 


GLIOMA 











2 


2 








2 


LUNG 


3.94 


1.98 


3.46 


2.95 


1.46 


1.46 


1.46 


1.96 


Pro-GE 


4.9 


3.9 


6.81 


5.9 


8.71 


6.71 


8.71 


9.71 


Average 


3.925 


2.47 


4.235 


4.07 


4.4225 


3.0675 


3.9225 


4.4425 


Table 2: Classification error (%) of different I 


2,p matrix norms 








Top 60 features 






Top 80 features 




P= 


0.25 


0.5 


0.75 


1 


0.25 


0.5 


0.75 


1 


ALLAML 


6.86 


5.52 


6.86 


8.29 


8.57 


5.71 


8.57 


8.57 


GLIOMA 


2 


2 


2 


4 


4 


2 


2 


4 


LUNG 


9.33 


7.37 


8.37 


10.3 


0.99 


0.99 


1.48 


1.48 


Pro-GE 


8.71 


6.71 


8.71 


9.71 


5.86 


3.95 


5.9 


5.9 


Average 


6.725 


5.4 


6.485 


8.075 


4.855 


3.1625 


4.4875 


4.9875 



The experimental procedure indicates that four Z2,p-norm (p = 0.25, 0.5, 0.75 and 
1) based minimizations do select different features, hence result in distinct classifica- 
tion performances. Parameter p £ (0, 1] in l2.p matrix norm balances the sparsity and 
non-convexity of optimization problem ( fTSI l. The closer to the p is, the sparser the 
representation is. While if p is near to 1, the model is almost convex. The classification 
error comparisons show that non-convex l2.p (0 < p < 1) matrix norms provide alter- 
natives to /2,1-norm. Especially, p — 0.5 empirically outperforms p = 1 in choosing 
better sparse pattern in various situations. 

In order to validate the efficient performance of the unified algorithm 14. 1 1 solving 
nonconvex l2^p (0 < p < 1) pseudo norm optimization problems as well as the convex 
?2,i-norm based minimization, we employ the relative reduction of objective function 
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My- MP _||y. 

Pk = ^'|f„ lit) ^'^ to estimate the convergence speed. Actually, the convergence 

II Il2,p 

behaviors for each ?2,p-norm case are similar. We display the change of pk with respect 
to iterative steps in the case of 80 features (see Figure 1). All experiments on four data 
sets uniformly get the expected accuracy within around 20 steps. 



ALLAML 




Figure 1: The convergence performance of four /2,p-norm based minimizations 



6 Conclusions 

In this paper, a kind of general l2.p matrix norms are proposed which are usually used 
in jointiy sparse optimization problems. A unified algorithm is designed to solve the 
mixed Z2,p-norm (p e (0, 1]) based sparse model and the convergence is also uniformly 
ensured. Experiment results on gene express data sets validate the unified performance 
of the proposed method. Meanwhile, this approach provides more choices of p € (0,1] 
to fit variety of jointly sparse structures. 
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